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ABSTRACT 

We give a test for protein coding regions which Is based on simple and 
universal differences between protein- coding and noncoding DNA* The test is 
simple enough to use without a computer and is completely objective* The test 
has been thoroughly proven on 400,000 bases of sequence data: it misclassifies 
5Z of the regions tested and gives an answer of "No Opinion" one fifth of the 
time* We predict some new coding and noncoding regions In published 
sequences • ' 

INTRODUCTION 

There has been for several years now a well known and very general need 
for a way to distinguish a true protein-coding sequence (PCS) from a merely 
fortuitous open reading frame (0RF) In known DNA sequences. The need arises 
mainly when a gene location is only approximately known at the start of 
sequencing* and the sequence turns out to have more than one candidate 0RF* 
Even when a gene has been located the surrounding sequence may contain other 
ORF's of unknown character, and a method to distinguish the true PCS's among 
these yields a powerful tool for the discovery and characterization of new 
proteins. 

We set ourselves the task of finding an objective and self-contained 
test, (or decision procedure) which when presented with a DNA sequence would 
classify it as either coding or noncoding (in this paper "coding" will always 
mean "coding for protein")* Later we decided to allow the test the option of 
refusing to classify an occasional sequence* To be of practical value such a 
test should not depend on the subjective evaluation of results by the user, 
and should have been checked on a large number of sequences so as to be of 
known reliability* We chose to look for a test depending on the overall 
statistical properties of the base sequence rather than on specific 
transcription or translation initiation signals for two reasons. First, 
initiation signals may be unavailable* This happens frequently when the 5' 
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end of an interesting ORF Is not included in the known sequence. It can also 
happen that a PCS has no initiation signals at all: cf. for example the lysis 
gene of phage MS2, which Is only translated upon readthrough of the stop codon 
of the previous gene (1), and the yeast mitochondrial lntrons which code for 
protein (reviewed in Ref. 2). Second, the problem of precisely characterizing 
what Is and what is not an initiation signal still looks extrememly difficult. 
We also chose to find a test which would give a simple codlng/noncodlng answer 
for a specific region, rather than trying to map all coding and noncodlng 
regions in a large sequence at once. This makes it easier to do meaningful 
large-scale reliability testing. Also, though our test is not adapted to 
finding the exact boundaries of coding regions, it is very well adapted for 
combination with other relevant algorithms, such as searches for ORF' 8, 
ribosome binding sites, intron boundaries, etc. 

Four papers have appeared in the last year which describe statistical 
patterns which are probably characteristic of coding regions in general. All 
of these patterns have the potential of forming the basis for a useful 
codlng/noncodlng test. However we believe that ours Is tfie first paper to 
give a fully specified and objective test, checked on a large number of 
sequences. Shulman et al. found (3) patterns In the coding regions of two 
phage that pointed to the three letter code and to the correct reading frame. 
However their sample was very small, and they did not investigate the 
predictive power of their observations. J. C. W. Shepherd, In researching the 
origin of the genetic code, found (4) periodicities in the autocorrelation 
functions of single bases and doublets in DNA, and applied this (5) to the 
problem of discovering the reading frame of a PCS. Though interesting 
patterns are found, no specific codlng/noncodlng test is given, and no 
evidence is presented that noncodlng DNA always lacks the patterns supposedly 
characteristic of coding DNA. Staden and McLachlan have written (6) a 
computer program for mapping the PCS's in a sequence by measuring the 
similarity of the codon usage strategy between a known PCS and the ORF under 
test. The method requires that the PCS used as a standard be closely related 
(in codon usage patterns) to any PCS discovered. This makes the method highly 
dependent on the judgement of the user, and may make it inapplicable in some 
cases. 

Another, more popular, vein of research Is in trying to characterise the 
signals for initiation of transcription and translation by which the cell 
Itself recognizes a PCS. For reasons given above we consider this a separate 
problem, complementary to the one we are considering, and only refer the 
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reader to the surveys of Gold et al. (7) and Breathnach and Chambon (8), and 
to the recent computer program of Rodler et al. (9). 
CHARACTERISTIC PARAMETERS OF CODING AND NONCODING REGIONS 

Many people have noticed patterns, or statistical order, In PCS' 8, but 
for the most part It has not been shown that these patterns consistently fall 
to appear In none odi tig DNA. In this section we will give a striking 
Illustration to show that some of the order In PCS's Is In fact characteristic 
of coding regions, and will then define some numerical parameters of sequences 
whose distributions reveal universal differences between coding and noncodlng 
DNA. 

All studies reported here are based on sequence data stored In the Los 
Alamos Sequence Library, a public databank on the CDC 7600 computers at Los 
Alamo 8 National Laboratory, currently listing 486,000 bases In 320 sequences. 
A description of the databank (Including references for the sequences) Is 
given In Ref. 10. Each sequence In the library was divided Into Its coding 
and noncodlng parts, based on the experimental evidence reported by the 
original authors: sections of sequence for which this Information was 
Incomplete were not used. In early experiments we found that sequences under 
200 bases (a somewhat arbitrary limit, considered further below) were too 
small to give reliable results. So for our primary data we took 321 fragments 
of coding DNA (230877 bases) and 249 fragments of noncodlng DNA (158987 
bases), each at least 200 bases long. (Thus a coding/ noncodlng decision made 
by the test given in this paper is based on the data in the Los Alamos 
Sequence Library. But we will show that our method Is general and can be 
based on any collection of sequence data.) 

Underlying all observations of statistical order in PCS's is the fact 
that cod on s are used with unequal frequency (for data and review see the work 
of Grantham et al. (11-13)). One consequence of this fact, which has been 
noted several times (3-5,14,15), is that oligonucleotides (and in particular 
nucleotides) tend to be repeated with a periodicity of three in a PCS. 
Figure 1 8 hows the autocorrelation function for thymine In the coding and 
noncodlng parts of the Los Alamos Library (we ignore the distinction between 
RNA and DNA throughout the paper, so T and U are considered synonymous). The 
first graph shows that in coding sequences the number of bases separating two 
T's is much more likely to be 2,5,8,11,... (2+3n) than it is to be 3n or ^ 
l+3n. I.e. in coding sequences Identical bases are most often found In 
identical codon positions. The second graph shows that this regularity is 
absent in noncodlng sequences. 
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We now turn Co Che definition of eight numerical parameters of DNA 
sequences which we use to distinguish coding from noncodlng regions. The 
first four parameters, motivated by Figure 1, measure the asymmetry in the 
distribution of each base among the three codon positions (or the analogous 
positions in a noncodlng sequence). Let 

A| - Number of A's in positions 1,4,7,10,... 
(1) A 2 - Number of A's in positions 2,5,8,11, ... 
A3 - Number of A's in positions 3,6,9,12,..; 
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FIGURE 1. Autocorrelation graphs for T (thymine) in the 321 coding and 249 
noncodlng fragments over 200 bases long In the Los Alamos Sequence Library. 
Top: For each possible separation k, we counted, In all coding fragments In 
the Library (a total of 231 kllobases) the number of times two T's appeared 
with k nucleotides between them, and compared this with the count expected in 
a model where bases are chosen independently - namely the number of blocks of 
k+2 nucleotides times the square of the overall T-content of the coding 
regions. The percent difference is graphed for k running from 0 to 24 and 
from 147 to 198. Bottom: The same for the noncodlng regions (159 kllobases). 
The wave so conspicuous for the coding regions is absent here. Findings were 
similar for the other three bases, and for pairs of unlike bases. The high 
values near the beginning of the noncodlng graph are probably due to AT 
clustering; otherwise the two graphs have about the same average value. 
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and similarly for C, G and T. Then define 

(2) A-Posltlon - 

and similarly for C, G and T. 

The parameters A-, C-, G- and T-Positlon measure the degree to which each 
base Is favored In one codon position over another. Note that It is 
Irrelevant which of the three codon positions favors the base; It Is only the 
degree to which the base Is favored that Is measured - this property gives 
these four parameters fairly similar distributions In all sequences, 
regardless of the well known differences In codon usage strategy between 
organisms* 

The other four parameters we use are Just the A-, O, G- and T-Content of 
the sequence (I.e. the percentage of the sequence contributed by each of four 
bases). Note that, as a practical matter, the counts Aj etc. made In the 
calculation of the Position parameters yield Immediately the Content 
parameters also. 

The relative distribution of these eight parameters between coding and 
noncodlng fragments Is shown in Table 1. All eight parameters will be used In 
a single test in the next section, but note that even In the distribution of 
Individual parameters the differences between coding and noncodlng DNA are 
evident. For example among fragments having a T-Position parameter less than 
1.2 (this Includes about one fourth of all fragments) there Is only a 9Z 
probability of coding function, while among fragments with T-Positlon 
parameter over 1.7 (again about one fourth of the total) the probability of 
coding is over 90Z. Table 1 contains all the information about these 
parameters needed for our decision procedure. The full distributions of the 
eight parameters, of interest in their own right, are given in Figure 2 and 
discussed further below* 

HOW TO DISTINGUISH CODING FRCM NONCODING SEQUENCES 

In the last section we gave the distribution of our eight test 
parameters. Next^ we will assign weights to each parameter, telling how much 
attention we should pay to it in making the final coding/ noncodlng decision. 
The parameter distribution and weights should need to be recalculated only 
very occasionally as more sequence data accumulates. Users of the 
coding/ noncodlng test will only need to do a very simple calculation detailed 
below. 

From Table 1 it is clear that, for example, the T-Position parameter of a 
sequence usually tells one a good deal more than its A-Content. To get a 
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table l 

Characteristic Parameters of Coding and Noncoding Sequences 



Position Parameter 






Probability of Coding 




0.0 


to 


1.1 


a; 


.22 


C: .23 


G: .08 


•p • no 
t: .09 


1.1 




1.2 




.20 


.30 


.08 




1.2 




1.3 




.34 


.33 


.16 


• 20 


1.3 




1.4 




.45 


.51 


.27 


C A 

.54 


1.4 




1.5 




.68 


.48 


.48 


.44 


1.5 




1.6 




.58 


.66 


.53 


.69 


1.6 




1.7 




.93 


.81 


.64 


.68 


1.7 




1.8 




• 84 


.70 


.74 


.91 


1.8 




1.9 




.68 


.70 


.88 


.97 


1.9 




2. Of 




.94 


.80 


.90 


.97 


Content 


Parameter 




Probability of Coding 




.00 


to 


.17 


A: 


.21 


C: .31 


G: .29 


T: .58 


.17 




.19 




.81 


.39 


.33 


.51 


.19 




.21 




.65 


.44 


.41 


.69 


.21 




.23 




.67 


.43 


.41 


.56 


.23 




.25 




.49 


.59 


.73 


.75 


.25 




.27 




.62 


.59 


.64 


.55 


.27 




• 29 




.55 


.64 


.64 


.40 


.29 




.31 




.44 


.51 


.47 


.39 


.31 




.33 




.49 


.64 


.54 


.24 


.33 




.99 




.28 


.82 


.40 


.28 



TABLE 1. The values of the eight parameters, A-, C-, G- and T-Posltion and 
A-, C-, G- and T-Content, were calculated for each of the 321 coding and 249 
noncoding fragments over 200 bases long in the Los Alamos Sequence Library 
(see text). The range of each parameter was divided Into ten intervals as 
shown (we use these same Intervals for any collection of sequence data). For 
each Interval the percentage of coding and noncoding fragments whose parameter 
fell therein was recorded. The value "Probability of Coding" shown is the 
percentage of coding fragments falling in the interval, divided by the 
percentage of coding plus the percentage of noncoding. This is essentially 
the fraction of all fragments falling in the Interval which are coding, but 
differs slightly because more coding than noncoding fragments are used. 



number telling us how much input each parameter should have in the final 
decision, we used each parameter alone to predict coding function, as follows: 
if a sequence fell in an Interval where the probability of coding (from Table 
1) was greater than one half the sequence was called coding, otherwise not. 
(I.e. if more coding than noncoding fragments share this parameter value with 
the fragment in question, we guess it is coding.) The weight for a given 
parameter is Just the percentage of the time that this guess was correct, less 
SOX (random level). The weights for each of the eight parameters are shown in 
Table 2. In giving these weights we are not making any important claim about 



6308 



Nucleic Acids Research 



POSITION 



CONTENT 
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FIGURE 2. The distribution of the Position and Content parameters for coding 
(heavy bars) and noncodlng (light bars) fragnents. See the legend of Table 1 
for details. 
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TABLE 2 

Weight to be Given to the Individual Parameters 



Position 



Content 



A 
C 
G 
T 



.26 
• 18 
.31 
.33 



.11 
.12 
.15 
• U 



TABLE 2. The weight shown is the percentage of the time (above 50X, the 
random level) that each parameter alone successfully predicted coding or 
noncoding function. 

these parameters; rather we are just deciding how to use them in our specific 
decision procedure. 

We can now describe TESTCODE, our algorithm for predicting whether a 
fragment of DHA is coding or not. Given a fragment of DNA, first make the 
counts A 1> C if G t and T ±9 1-1,3 (equation (1)). From these calculate the 
eight parameters A-, C-, G- and T-Position (equation (2)), and A-, C-, G- and 
T-Content of the fragment. For each of these parameters look up the 
"Probability of Coding" value in Table 1; call these probabilities p^...^. 
Let the corresponding weights, given in Table 2, be denoted Wp... f wg. The 
sum pjWj + ... + pgWg is the TESTCODE indicator of coding function. Its 
distribution in the Los Alamos Library, and the predictions corresponding to 
its different values, are shown in Table 3. (A more familiar way to combine 
the information from the eight parameters would be to use Bayes' formula. But 
in using Bayes' formula we assume that the eight parameters are independent, 
which of course is not the case. So it is not surprising that the method 
given above worked a little better.) 
RELIABILITY OF THE METHOD 

From Table 3 it is clear that TESTCODE correctly predicted the function 
of all but a few of the fragments used in the study. However since we used 
these same fragments to calculate the parameter distributions which TESTCODE 
uses, one might object that perhaps the algorithm was Just "remembering" 
special properties of the Los Alamos collection, and would be less reliable 
for distinguishing coding and noncoding DNA in general. To take care of this 
objection we divided the Los Alamos Library into two parts, calculated the 
distribution of bur eight parameters on one half, and used this information to 
predict which fragments in the other half coded for protein. There was only a 
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TABLE 3 

Distribution of the TESTCODE Indicator 



TESTCODE 


Indicator 


Probability of Coding 


Prediction 


0.32 


to 0.43 


0.00 


Noncoding 


0.43 


0.53 


0.04 


Noncoding 


0.53 


0.64 


0.07 


Noncoding 


0.64 


0.74 


0.29 


Noncoding 


0.74 


0.84 


0.40 


No Opinion 


0.84 


0.95 


0.77 


No Opinion 


0.95 


1.05 


0.92 


Coding 


1.05 


1.16 


0.98 


Coding 


1.16 


1.26 


1.00 


Coding 


1.26 


1.37 


1.00 


Coding 



TABLE 3. The distribution of the TESTCODE indicator, our predictor of coding 
function, is shown on all the 321 coding and 249 noncoding fragments used in 
this study. "Probability of Coding" is calculated just as in Table 1. The 
last column gives the TESTCODE prediction of function for a fragment whose 
indicator value falls in the corresponding interval. In calibrating TESTCODE 
on any set of sequence data there is always a natural cutoff point (in this 
case .84) above which every interval contains more coding than noncoding 
fragments, and below which every interval contains more noncoding than coding 
fragments. We always make the two Intervals flanking this cutoff the "No 
Opinion" range. 



5Z error rate in these predictions, showing that TESTCODE is almost certainly 
based on universal differences between coding and noncoding DNA, Independent 
of the Los Alamos collection. 

In more detail our procedure was as follows: We numbered the coding 
fragments from 1 to 321 and the noncoding from 1 to 249. We then calculated 
the relative distribution of our eight parameters, as in Table 1, and the 
weights to use with them, as in Table 2, but using only the odd-numbered 
fragments as our data set. We then used the resulting parameter distributions 
to calculate a TESTCODE indicator for each of the even-numbered fragments. 
The range of the indicator was divided into 10 equal intervals, as in Table 3. 
Any fragment whose indicator fell in the top four intervals was judged coding, 
any In the bottom four noncoding, and In the middle two intervals no answer 
was given. The TESTCODE prediction was "No Opinion" on 18Z of the fragments. 
6Z of the coding segments were Judged incorrectly as "Noncoding", and 3X of 
the noncoding segments were judged incorrectly as "Coding". The actual 
distribution of the TESTCODE indicator Is given in Figure 3. 

In the future, when a larger sample of sequences is available, it may be 
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FIGURE 3. Results of the reliability test for TESTCODE. After the TESTCODE 
indicator was calculated for all even fragments the range of the Indicator was 
divided into ten equal Intervals, whose endpolnts are marked on the abscissa. 
The percentage of coding (shaded bars) and noncoding (open bars) fragments 
whose TESTCODE indicator fell in each Interval is graphed. The "No Opinion*' 
range is boxed. 

worthwhile to use separate data sets when using TESTCODE on fragments from 
different taxonomlc classes. For example when we ran the kind of reliability 
test just described using only vertebrate nuclear sequences, we found that 
TESTCODE returned "No Opinion" on only 12Z of the fragments used, and only 
misclassified 3Z. For the vertebrate study we used 82 coding and 102 
noncoding fragments; other taxonomlc groups are still rather small for this 
kind of reliability test. 

Throughout this study we have restricted attention to fragments over 200 
bases long. It turns out that in fact TESTCODE 's reliability is unacceptable 
on shorter fragments. When we used TESTCODE (just as specified in the 
preceedlng section) to predict the function of the 57 noncoding and 159 coding 
fragments in the library between 100 and 199 bases in length, the predictions 
were Incorrect 13Z of the time, and the "No Opinion" rate was 29Z. 200 bases 
seems to be a reasonable minimum, for when predictions were made in the length 
ranges 200-299, 300-399, 400-499, 500-599, 300+ and 6 0O+ the error rate was 
always close to 5Z. The chief effect of the length, above 200 bases, seems to 
be on the "No Opinion" rate, which is 24Z for fragments of 200-299 bases, but 
under 15Z for longer fragments. 

PREDICTION OF CODING AND NONCODING REGIONS IN PUBLISHED SEQUENCES 

We have scanned the Los Alamos Sequence Library for ORF'a not associated 
with a known protein, and have rated them all With TESTCODE. In this section 
we give a few of our more interesting findings. Our predictions are 
summarized in Table 4; further comments on some are given below. A general 
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TABLE 4 

Predicted Coding and Noncodlng ORF's 



Organism 


Reported Sequence 


Ref . 


Open Frame 


Prediction 


Adenovirus? 


Transforming Region 


(17) 


402 


to 


166** 


Coding (.92) 


A. nldulans 


Cytochrome B 


(18) 


507 


to 


713+* 


Coding (.98) 


E. coll 


Insertion Element I 


(19,20) 


250 
56 


to 
to 


753* 
331 f 


Coding (.98) 
No 0pinion(.77) 


E. coll 


Origin of Replication 


(21-24) 


734 
1282 


to 
to 


291** 

824 


Coding (.92) 
Coding (1.0) 


E. coll 


Rib080mal Operon B 


(25-28) 


275 
2699 
6916 


to 
to 
to 


1144 # 

2959. 
7506' 


Coding (.98) 
No 0pinion(.77) 
Coding (1.0) 


Human 


6~hemoglobin 


(29) 


1493 


to 


1810 


Coding (.98) 


Yeast 


18S rRNA 


(30) 


1349 


to 


1149* 


Coding (.98) 


Yeast 


2u plasmid 


(31,32) 


5570 
2008 
5198 
2271 
6258 


to 
to 
to 
to 
to 


523 L 

887 *5 

4308/ 
2816* 
5905 f 


Noncodlng (.29) 
No 0pinion(.77) 
Coding (.98) 
No 0pinion(.40) 
Noncodlng (.04) 



TABLE 4. As far as we know none of the ORF's listed here has been shown to be 
coding or noncodlng. Numbering of the sequence Is as in the (first of the) 
reference(s) cited. The ranking by TESTC0DE (from Table 3) is given in the 
last column, 

*Complemetary strand from that given in reference. 
.No start codon. 

Possible coding function suggested in reference. 



experimental method for identifying the protein product of any 0RF, if It 
exists, has been given (16), so these predictions provide a way to assess the 
usefulness of TESTC0DE as an exploratory tool. 

The gene products of the Adenovirus transforming region are of great 
Interest, yet we have seen no mention of the Adenovirus ORF listed in Table 4. 
Although It ha 8 no start codon, it might be spliced with other ORF's upstream 
on the same strand. 

It ha 8 been shown that the box3 lntron of yeast cytochrome b codes for a 
protein maturase, and other yeast mitochondrial introns are suspected of 
coding (reviewed in Ref. 2). Waring et al. (18) have shown that the 
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situation in Aspergillus nidulans is similar to that in yeast; the single 
intron in the cytochrome b gene of A. nidulans has a long ORF which continues 
in phase with the previous exon. Since the probability, according to 
TESTCODE, that this ORF codes is .98, it looks very likely that coding introns 
will be found in organisms other than yeast. 

There is considerable Interest In protein products which may be coded by 
movable DNA elements and which may help to Insert and excise them. TESTCODE 
ranks very highly one long ORF in Insertion Element I of E. coll. Ohtsubo et 
al. (20) have sequenced an analogue of this insertion element in Shigella 
dysentarlae and have shown that in this ORF (and another which is ranked 
ambiguously by TESTCODE) many more of the differences from Insertion Element 1 
occur In third codon position than in first or second - a strong Indication 
that both ORF's code* 

The first ORF listed for the E. coll replication origin has been noted 
before, and in fact evidence that supports its probable coding function is 
given in Ref. 23. The second ORF listed, however, seems to have escaped 
attention. 

The 3' flanking regions of many vertebrate genes have short ORF's, partly 
overlapping the gene, which rank highly. We include one fairly long one 
associated with Human 6-hemoglobln, which is clearly separate from the main 
gene. (25Z of the designated ORF overlaps the hemoglobin gene. The remaining 
75Z of the ORF was tested separately and found to have a .92 probability of 
coding • ) 

The possible PCS listed for Yeast 18$ rRNA is particularly interesting 
because no PCS is known to overlap a rlbosomal RNA gene. Many ribosomal RNA 
gene 8 in the Los Alamos Library contain long ORF's; the second ORF listed from 
the E. coll RRNB operon is another. 

We have examined all the ORF's of an important cloning vector, the yeast 
2 micron plasmid, and offer our opinion on Its overall coding capacity. 
WHY TESTCODE WORKS 

In this section we show that TESTCODE 's success can be understood in 
terms of two simple facts: 1) Any kind of consistent non-random codon use 
results in uniformly high Position parameters, and 2) Coding sequences have 
higher GC-content, on average, than noncodlng sequences. We begin by 
explaining more fully the connection between codon usage and our Position 
parameters. 

Suppose we had an organism In which A was suppressed in third codon 
position, but shared first and second codon position equally with the other 
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three bases. Thus the probability that the first base of a codon was A would 
be ,25, and likewise the second, but the probability that the third base was 
A might be only .15. Then in a PCS of length N we would have, approximately, 
Aj-.25N, A2~.25N and A3-.I5N, so that the expected value of A-Position would 
be about .25/. 15, or 1.7. Now note that if we had another organism in which 
third position A was favored instead of suppressed, so that the probabilltes 
of finding an A in each of the three positions was, say .22, .25 and .35, 
respectively, the expected value of the A-Position parameter would be 
.35/. 22-1. 6, a similar value. Thus it turns out that all the very different 
coding strategies used by different creatures lead to the same result - 
Position parameter values mostly in the range 1.5 to 4.0 (whereas noncoding 
fragments have Position values, generally, in the range 1.0 to 1.5). As we 
mentioned earlier, this Is what makes our one calculation applicable to all 
different kinds of sequences. 

To take an actual example, the probabilities of finding an A in each of 
the three codon positions in vertebrates are .27, .31, and .15. We would 
predict from this an average A-Position parameter of .31/. 15-2.1, while the 
actual average is 3.2. The true average is higher because the PCS' 8 exhibit 
stronger codon usage preferences individually than one sees in the overall 
average. In the same way the predicted average C-, G-, and T-Parameter values 
are 1.6, 1.6 and 1.5 respectively, while the actual averages are 1.8, 1.9 and 
1.9. 

As one can see from Table 2, TESTCODE's decision is based mainly on the 
Position parameters. However the base content of the sequence shows some 
clear trends and does contribute a few percent to the reliability. The most 
noticeable trend in the base content data is that the GC-content of coding 
sequences tends to be higher than that of noncoding sequences. 

To test whether these statistical trends really account for TESTCODE's 
performance, we generated artificial random "coding" and "noncoding" sequences 
and rated them with TESTCODE. For our synthetic "coding" sequences we 
generated succesive codons independently and at random, with the same 
frequencies as genuine vertebrate sequences. (The Library as a whole does not 
show strong codon preference rules, so we needed to limit ourselves to a more 
internally consistent set of data. There is no reason to think that the 
choice of vertebrate instead of, say, E. coli sequences is significant.) For 
our "noncoding" sequences we generated successive bases independently and at 
random, with frequency .27 for A and T, and .23 for G and C (again the 
frequencies of vertebrate sequences). We generated 100 coding and 100 
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noncodlng random sequences! each 600 bases long (the average length of the 
real coding and noncodlng fragments used). TESTCODE, using the data from real 
sequences listed In Tables 1-3, classified only 2Z of the random sequences 
Incorrectly, and gave an answer of "No Opinion" on only 172. 

SUMMARY AND DISCUSSION 

We have used certain universal differences between protein- coding and 
noncodlng regions to produce a simple algorithm TESTCODE which distinguishes 
coding from noncodlng DNA with high reliability. When TESTCODE was calibrated 
on one half of the Los Alamos Sequence Library and then used to predict the 
coding or noncodlng regions in the other half it gave an answer of "No 
Opinion" on 18% of the regions tested, and had an overall error rate of only 
5Z. We have used TESTCODE to predict a number of new coding and noncodlng 
regions In published sequences. 

A method for distinguishing coding from noncodlng DNA has a large number 
of potential uses. First, after a fragment of DNA known to contain the gene 
for a certain protein has been isolated and sequenced, it often turns out to 
contain several ORF's from among which one must choose the correct one. A 
recent example is the search for the E. coll trpR gene by Singleton et al. 
(33). The authors considered three possible ORF's and discovered the correct 
one by mutation analysis. TESTCODE rates only the correct one as coding. 
Thus TESTCODE (or a related algorithm) may be able to reduce the experimental 
work In such cases to a single confirmatory experiment. Second, when newly 
sequenced DNA Is found to contain an ORF of unknown function, TESTCODE may be 
used to decide whether it Is likely to code for a new protein* This could be 
a powerful technique for discovering new proteins. One can even Imagine the 
day when semi-automated sequencing of entire genomes followed by computer 
analysis of the results could fully catalogue the proteins of an organism. A 
third use for TESTCODE is in checking the accuracy of the data in 
computer-based sequence libraries. We discovered several errors in the Los 
Alamos Library with the help of TESTCODE. 

We think that TESTCODE will prove to be useful both to experimentalists 
in their initial analysis of sequence data and to theoreticians as they learn 
about the differences between coding and noncodlng DNA. However we do not 
claim to have discovered the ultimate coding/ noncodlng test. Indeed, the main 
value of this paper as we see it is that it presents one method for 
recognising coding sequences which Is spelled out in complete detail and has 
been tried out on a large collection of sequence data. Thus other people can 
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easily use TESTCODE and know how to Interpret the results. We will gladly 
make available our programs and data to anyone wishing to more fully develop 
and test other methods. (They are available on-line to users of the Los 
Alamo 8 Library* Others may request a tape by mall.) 

Research on TESTCODE -like algorithms is complementary to several other 
lines of research. For example on the one hand TESTCODE only has a resolution 
of 200 bases and can not pinpoint the exact boundaries of a PCS, while on the 
other hand methods for recognizing signals for the initiation of 
transcription, Initiation of translation, and lntron splicing are poorly 
developed and require additional confirmation; thus these two methods can 
profitably be combined. Also, since TESTCODE is completely insensitive to 
phase, It can only be used to tell when a region Is coding, and not what the 
coding frame is. This limitation can usually be overcome by combining 
TESTCODE with a search for ORF's, but when two ORF's overlap in different 
phases, another method Is needed to decide which Is the correct one. This can 
very likely be done using published methods mentioned In the introduction 
(3-6). Users of TESTCODE should be aware of one other point: we have not 
checked TESTCODE on regions of mixed coding/ noncoding character. Thus it 
would be best to apply TESTCODE to regions that will be either fully coding or 
fully noncoding, for example ORF's starting at the last probable fMET codon. 

There is some interesting regularity In the errors that TESTCODE makes. 
In coding sequences which are Incorrectly classified as noncoding it often 
seems that some use Is being made of the DMA which causes the usual codon 
preference rules to be overridden. For example one of two overlapped viral 
genes is sometimes classified as noncoding. Also, variable regions of 
immunoglobulin genes often are rated noncoding, presumably because the 
mechanism which generates diversity of these regions is stronger than whatever 
force encourages consistent codon preference. A very Interesting example 
pertains to the yeast mating type loci. The four presumptive PCS's there are 
rated noncoding - possibly this means that some other pattern is present in 
this region of the DNA which is necessary to enable transposition. 
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□ 1: Proc Int Conf Intell Syst Mol Biol. 1994;2:354-62. 
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The prediction of human exons by oligonucleotide 
composition and discriminant analysis of spliceable open 
reading frames. 



Solovyev VV , Salamov AA , Lawrence CB . 

Department of Cell Biology, Baylor College of Medicine, Houston, TX 
77030, USA. 

Discriminant analysis is applied to the problem of recognition 
5'-, internal and 3'-exons in human DNA sequences. Specific 
recognition functions were developed for revealing exons of 
particular types. The method based on a splice site prediction 
algorithm that uses the linear Fisher discriminant to combine 
the information about significant triplet frequencies of various 
functional parts of splice site regions and preferences of 
oligonucleotides in protein coding and intron regions 
(Solovyev, Lawrence, 1994). The accuracy of our splice site 
recognition function is about 97%. A discriminant function for 
S'-exon prediction includes hexanucleotide composition of 
upstream region, triplet composition around the ATG codon, 
ORF coding potential, donor splice site potential and 
composition of downstream intron region. For internal exon 
prediction, we combine in a discriminant function the 
characteristics describing the S'-intron region, donor splice 
site, coding region, acceptor splice site and 3'-intron region 
for each open reading frame flanked by GT and AG base pairs. 
The accuracy of precise internal exon recognition on a test set 
of 451 exon and 246693 pseudoexon sequences is 77% with 
a specificity of 79% and a level of pseudoexon ORF prediction 
of 99.96%. The recognition quality computed at the level of 
individual nucleotides is 89% for exon sequences and 98% for 
intron sequences. A discriminant function for 3'-exon 
prediction includes octanucleotide composition of upstream 
intron region, triplet composition around the stop codon, ORF 
coding potential, acceptor splice site potential and 
hexanucleotide composition of downstream region. (ABSTRACT 
TRUNCATED AT 250 WORDS) 

PMID: 7584412 [PubMed - indexed for MEDLINE] 
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□ 1: J Biochem (Tokyo). 1986 Jun;99(6):1579-90. 
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Determination of the initiation sites of transcription and 
translation of the uvrD gene of Escherichia coli. 



Yamamoto Y , Ogawa T , Shinagawa H , Nakayama T , 
Matsuo H , O gawa H . 

Prior to the analysis of transcription and translation, the 
nucleotide sequence of the uvrD gene and its neighboring 
regions was determined by the method of Maxam and Gilbert 
(Maxam & Gilbert (1980) Methods Enzymol. 65, 499-560). 
Disagreement in 14 positions between the nucleotide 
sequence determined by us and that reported previously 
(Finch & Emmerson (1984) Nucl. Acids Res. 11, 5789-5799) 
was found. We reexamined these disputed regions. The 
initiation site of transcription of the uvrD gene was 
•determined by analyzing the transcripts synthesized in vitro. 
It was found that transcription of the uvrD gene starts from 
the A nucleotide, which is the first one of the SOS box of the 
uvrD. The amino terminal sequence and the amino acid 
composition of the purified UvrD protein (helicase II) were 
determined. It was found that translation starts from the first 
ATG codon, which lies 77 nucleotides downstream from the 
initiation site of transcription. The amino acid composition of 
the purified UvrD protein agreed well with that deduced from 
the nucleotide sequence. 

PMID: 2943729 [PubMed - indexed for MEDLINE] 
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Abstract 



We have developed a hierarchical rule base system for identifying genes in DNA sequences. 
Atomic sites (such as initiation codons, stop codons, acceptor sites and donor sites) are 
identified by a number of different methods and evaluated by a set of filters and rules chosen 
to maximize sensitivity; these are combined into higher-order gene elements (such as exons), 
evaluated, filtered and combined as equivalence classes into probable genes, which are 
evaluated and ranked. The system has been tested on an extensive collection of vertebrate 
genes smaller than 15,000 bases. Results obtained show that, on average, 88% of the 
predicted coding region for a transcription unit is actually coding, and 80% of the actual 
coding is correctly predicted. This will, in most applications, be sufficient for a search 
against protein sequence databases for the identification of probable gene function. In 
addition, the system provides a general test platform for both gene atomic site identification 
and the rules for their evaluation and assembly. 

Author Keywords: gene identification; exon structure; intron splicing; coding sequence; 
artificial intelligence 
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Abstract 

A new method which predicts internal exon sequences in human DNA has be« 
method is based on a splice site prediction algorithm that uses the linear discri 
combine information about significant triplet frequencies of various fiinctiona 
regions and preferences of oligonucleotides in protein coding and intron regio 
splice site recognition function is 97% for donor splice sites and 96% for acce 
exon prediction, we combine in a discriminant function the characteristics des 
region, donor splice site, coding region, acceptor splice site and 3'-intron regk 
reading frame flanked by GT and AG base pairs. The accuracy of precise inte: 
on a test set of 451 exon and 246693 pseudoexon sequences is 77% with a sp* 
recognition quality computed at the level of individual nucleotides is 89% for 
98% for intron sequences. This corresponds to a correlation coefficient for ex< 
The precision of this approach is better than other methods and has been testo 
We have also developed a means for predicting exon-exon junctions in cDNA 
be useful for selecting optimal PCR primers. 
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The prediction of exons through an analysis of 
spliceable open reading frames. 



Hutchinson GB , Hayden MR . 

Department of Medical Genetics, University of British 
Columbia, Vancouver, Canada. 

We have developed a computer program which 
predicts internal exons from naive genomic 
sequence data and which will run on any IBM- 
compatible 80286 (or higher) computer. The 
algorithm searches a sequence for 'spliceable 
open reading frames' (SORFs), which are open 
reading frames bracketed by suitable splice- 
recognition sequences, and then analyzes the 
region for codon usage. Potential exons are 
stratified according to the reliability of their 
prediction, from confidence levels 1 to 5. The 
program is designed to predict internal exons of 
length greater than 60 nucleotides. In an analysis 
of 116 genes of a training set, 384 out of 441 
such exons (87.1%) are identified, with 280 
(63.5%) of predictions matching the true exon 
exactly (at both 5' and 3' splice junctions and in 
the correct reading frame), and with 104 (23.6%) 
exons matching partially. In a similar analysis of 
14 genes in a test set unrelated to the genes used 
to generate the parameters of the program, 70 
out of 80 internal exons greater than 60 bp in 
length are identified (87.5%), with 47 completely 
and 23 partially matched. SORFs that partially 
match true internal exons share at least one 
splice junction with the exon, or share both splice 
junctions but are interpreted in an incorrect 
reading frame. Specificity (the percentage of 
SORFs that correspond to true exons) varies from 
91% at confidence level 1 to 16% at confidence 
level 5, with an overall specificity of 35-40%. The 
output displays nucleotide position, confidence 
level, reading frame phase at the 5' and 3' ends, 
acceptor and donor sequences and scoring 
statistics and also gives an amino acid translation 
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of the potential exon. SORFIND compares 
favourably with other programs currently used to 
predict protein-coding regions. 
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File: USPT 



Dec 15, 1992 



DOCUMENT- IDENTIFIER: US 5171844 A 

** See image for Certificate of Correction ** 

TITLE: Proteins with factor VIII activity: process for their preparation using 
genetically-engineered cells and pharmaceutical compositions containing them 

Drawing Description Text (5) : 

a. The landmarks of the pSV2 -derived vector: two tandemly situated promoters: the 
SV40 early transcription promoter (SVep) and the Rous Sarcoma Virus-Long Terminal 
Repeat (RSV-LTR) ; the capping site (cap site) and 5' end of the messenger RNA 
(mRNA) ; the cDNA insert bearing the full-length Factor VIII coding region with the 
start codon (ATG) , the open reading frame and the stop codon (TGA) / the 3' 
noncoding region of the mRNA with a short intron and the polyadenylation signal 
(polyA) derived from SV4 0 DNA (compare to FIG. 2) . 
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L14: Entry 255 of 268 



File: USPT 



Mar 2, 



1993 



DOCUMENT- IDENTIFIER: US 5190756 A 

TITLE: Methods and materials for- expression of human plasminogen variant 



Detailed Description Text (74) : 

Mammalian expression vector pRK-tPA was prepared from pRK5 (described in EP 
307,247, supra, where the pCIS2.8c28D starting plasmid is described in EP 278,776 
published Aug. 17, 1988 based on U.S. Ser. Nos . 07/071,674 and 06/907,297) and from 
t-PA cDNA (Pennica et al., Nature, 301: 214 (1983)) . The cDNA was prepared for 
insertion into pRK5 by cutting with restriction endonuclease Hindlll (which cuts 4 9 
pairs 5' of the ATG start codon) and restriction endonuclease Ball (which cuts 276 
base pairs downstream of the TGA stop codon) . This cDNA was ligated into pR'K5 
previously cut with Hindlll and Smal using standard ligation methodology (Maniatis 
et al., Molecular Cloning: A Laboratory Manual, Cold Spring Harbor Laboratory, New 
York, 1982). This construct was named pRK-t-PA, and is shown in FIG. 2. 
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Plasminogen variants are defined as molecules in which the amino acid sequence 
of native Pg has been modified, typically by a predetermined mutation, wherein 
at least one modification renders the plasminogen resistant to proteolytic 
cleavage to its two-chain form. Amino acid sequence variants by Pg include, 
for example, deletions from, or insertions or substitutions of, residues 
within the amino acid Pg sequence shown in FIG. 1. Any combination of 
deletion, insertion and substitution may also be made to arrive at the final 
construct, provided that the final construct possesses the desired resistance 
to cleavage and biological activity. Obviously, it is preferred that the 
mutations made in the DNA encoding the variant Pg do not place the sequence 
out of reading frame and it is further preferred that they do not create 
complementary regions that could produce secondary mRNA structure (see, e.g., 
European Patent Publication No. 075,444). 
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. The human .alpha, subunit cDNA was engineered for expression by digesting 
the full-length clone with Ncol, which spans the start ATG, and Hindlll, which 
cleaves in the 3' untranslated region 215 base pairs (bp) downstream of the 
TAA stop codon. A 5' Sail site and Kozak consensus sequence (27) was provided 
by synthetic oligonucleotides, and a 3' Sail site by attaching linkers as 
described above. The DNA sequence of the engineered .alpha, subunit cDNA 
clone, which is approximately 600 bp in length, is shown in Table 7. This was 
inserted into the Xhol site of the CLH3AXSV2DHFR expression vector (FIG. 2). 
The endogenous 5 1 untranslated region and 3 1 polyadenylation signal were 
removed from the cDNA clone in the process of engineering and therefore were 
supplied by vector sequences: the MT-I promoter and the simian virus 40 (SV40) 
early polyadenylation signal, respectively. 
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TITLE: Newcastle disease virus gene clones 

Detailed Description Text (42) : 

Referring first to the F gene cDNA and proceeding in the 5' to 3' direction, the 
F.sub.o -coding region is though to extend from the proposed ATG start codon at 
nucleotides 47-49 to a TGA stop codon at 1706-1708. The cDNA encodes the F.sub.o 
polypeptide which is cleaved in vivo to F.sub.2, F.sub.l (F.sub.2 being to the 5 1 - 
end of the F.sub.o gene cDNA, F.sub.l to the 3' -end) . Cleavage occurs at the C- 
terminal side of the arginine encoded by nucleotides 392-394. The amino acid 
sequence after the proposed cleavage site, viz that encoded by nucleotides 395-454, 
is the same as that of the 2 0 amino acids at the N- terminal of F.sub.l determined 
by C. D. Richardson et al . , supra. Beyond the end of the F.sub.l -coding sequence 
is a non-coding portion corresponding to the 3 ' end of the mRNA which then 
terminates in a poly-A sequence at nucleotides 1787-1792. 
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Referring first to the F gene cDNA and proceeding in the 5' to 3 1 direction, 
the F.sub.o -coding region is though to extend from the proposed ATG start 
codon at nucleotides 47-49 to a TGA stop codon at 1706-1708. The cDNA encodes 
the F.sub.o polypeptide which is cleaved in vivo to F.sub.2, F.sub.l (F.sub.2 
being to the 5 ' -end of the F.sub.o gene cDNA, F.sub.l to the 3 ' -end) . Cleavage 
occurs at the C-terminal side of the arginine encoded by nucleotides 392-394. 
The amino acid sequence after the proposed cleavage site, viz that encoded by 
nucleotides 395-454, is the same as that of the 20 amino acids at the N- 
terminal of F.sub.l determined by C. D. Richardson et al . , supra. Beyond the 
end of the F.sub.l -coding sequence is a non-coding portion corresponding to 
the 3 ' end of the mRNA which then terminates in a poly-A sequence at 
nucleotides 1787-1792. 

4 3 The DNA sequence shows 
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The DNA sequence shows five significant potential asparagine- linked 
glycosylation sites in F.sub.o, one (NRT) in F.sub.2 at 299-307 and four (NKT, 
NTS, NIS and NNS) in F.sub.l at 617-625, 1142-1150, 1385-1393 and 1457-1465. 
The NNT site near the C-end of F.sub.l is considered insignificant since it 
lies in the region of the protein which does not cross the membrane. 

44 The amino acid sequence of the HN polypeptide gene is shown with an ATG start 
codon at nucleotides 1915-1917 and a TAG stop codon at nucleotides 3646-3648; 
this is followed by a 177 -nucleotide non-coding region which terminates in a 
poly-A sequence at the 3' end of the mRNA. The DNA sequence shows six 
potential glycosylation sites in HN, (NNS, NDT, NKT, NHT, NPT, NKT) at 2269- 
2277, 2935-2943, 3211-3219, 3355-3363, 3412-3420 and 3526-3534. 

45 The non-coding region contains encodes a potential glycosylation site (NQT) at 
3712-3720 and has a further TGA stop codon at 3757-3759, near the 3' end of 
the mRNA, which may provide an explanation for the origin of HN.sub.o in 
certain strains of NDV. 

46 The HN proteins of the NDV strains Ulster and Queensland are known to be 
synthesised in a precursor form (HN.sub.o) which is cleaved to active HN by 
the removal of a C-terminal glycopeptide . These considerations suggest that 
the gene encoding the HN.sub.o precursor for the HN protein of certain 
avirulent NDV strains may differ from the genes encoding the HN proteins of 
more virulent strains of NDV by mutations generating a longer open reading 
frame and the consequent synthesis of a larger HN polypeptide. 

47 Full length cDNA encoding the F and HN polypeptides 
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L14: Entry 238 of 268 



File: USPT 



May 28, 1996 



DOCUMENT- IDENTIFIER: US 5520911 A 

TITLE: Variants of plasminogen activators and processes for their production 



Detailed Description Text (160) : 

Plasmid pRK7 was used as the vector for generation of the t-PA mutants. This 
plasmid, described in EP 278,776 published Aug. 17, 1988, is identical to pRK5 (EP 
publication number 307,247 published 15 Mar. 1989), except that the order of the 
endonuclease restriction sites in the polylinker region between Cla I and Hind III 
is reversed. The t-PA cDNA (Pennica et al . , Nature, 301: 214 (1983)) was prepared 
for insertion into the vector by cutting with restriction endonuclease Hind III 
(which cuts 49 base pairs 5' of the ATG start codon) and restriction endonuclease 
Bal I (which cuts 276 base pairs downstream of the TGA stop codon) .. This cDNA was 
ligated into pRK7 previously cut with Hind III and Sma I using standard ligation 
methodology (Maniatis et al . , Molecular Cloning: A Laboratory Manual, Cold Spring 
Harbor Laboratory, New York, 1982) . This construct was named pRK7-t-PA. 
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L14: Entry 232 of 268 



File: USPT 



Mar 4, 1997 



DOCUMENT- IDENTIFIER : US .5608036 A 

TITLE: Enhanced secretion of polypeptides 

Detailed Description Text (71) : 

BDNFopt3 was prepared with polymer -supported synthesis using standard 
phosphoramidite chemistry methods., Due to the length of BDNFopt3 , the gene was 
synthesized as four separate segments: segment 1 is 104 bases and contains some 5' 
untranslated sequence corresponding to vector sequence, an Xbal restriction site, 
an ATG start codon, and the first 76 bases of the BDNFopt3 nucleic acid sequence; 
segment 2 contains the next 117 bases of BDNFopt3 ; segment 3 contains the next 107 
bases of BDNFopt3 ; and segment 4 contains the remaining 57 bases of BDNFopt3 along 
with the TAA stop codon, a BamHI restriction site sequence, and 5 additional 
nucleotides. The segments were ligated together using standard ligation protocols. 
Prior to ligation, three oligonucleotides were hybridized to the BDNFopt3 gene 
fragments to ensure that the four gene segments would be ligated together in the 
proper order. Each of the three oligonucleotides used spans one of the BDNFopt3 
gene fragment junctions. The nucleic acid sequence of each of these 
oligonucleotides is set forth below: ##STR1## 
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L15: Entry 2 6 of 2 6 



File: USPT 



Sep 29, 1992 



DOCUMENT- IDENTIFIER: US 5151267 A 

TITLE: Bovine herpesvirus type 1 polypeptides and vaccines 
Detailed Description Text (104) : 

The gl gene maps between 0.422 and 0.443 genome equivalents (FIGS. 4 and 5), which 
is within the BHV-1 Hindlll A fragment described by Mayfield et al. (1983), supra. 
A Kpnl plus AccI partial digestion of the Hindlll A fragment produces a 3255 base 
pair (bp) subfragment which contains the entire gl gene coding sequence. DNA 
sequence analyses placed an AccI site 20 bp 5* to the ATG start codon, while the 
Kpnl site is 420 bp 3' to the TGA stop codon. This fragment was inserted into a 
synthetic DNA polylinker present between the EcoRI and Sail sites of PBR328 (i.e., 
ppol26, not shown) to produce pgB complete (FIG. 8) . To this end, the AccI 
asymmetric end of the 3255 bp fragment was first blunted with Klenow enzyme and the 
gl fragment was then ligated to the Hpal plus Kpnl sites of ppol2 6 to give pgB 
complete. Hpal and Kpnl sites are within the polylinker of ppol26 and are flanked 
respectively by a Bglll and a BamHI site. The gl gene was then transferred from pgB 
complete as a 3260 bp Bglll+BamHI fragment to the BamHI site of the vaccinia virus 
insertion vector pGS20 (FIG. 9) to generate pgBvax (plasmid pGS2 0 with gl gene) . 
Moss et al. in Gene Ampification and Analysis, Vol. 3, pp. 201-213 (Papas et al . 
eds. 1983) . 
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The gl gene maps between 0.422 and 0.443 genome equivalents (FIGS. 4 and 5), 
which is within the BHV-1 Hindlll A fragment described by Mayfield et al . 
(1983), supra. A Kpnl plus AccI partial digestion of the Hindlll A fragment 
produces a 3255 base pair (bp) subfragment which contains the entire gl gene 
coding sequence. DNA sequence analyses placed an AccI site 20 bp 5' to the ATG 
start codon, while the Kpnl site is 4*20 bp 3' to the TGA stop codon. This 
fragment was inserted into a synthetic DNA polylinker present between the 
EcoRI and Sail sites of PBR328 (i.e., ppol26, not shown) to produce pgB 
complete (FIG. 8) . To this end, the AccI asymmetric end of the 3255 bp 
fragment was first blunted with Klenow enzyme and the gl fragment was then 
ligated to the Hpal plus Kpnl sites of ppol26 to give pgB complete. Hpal and 
Kpnl sites are within the polylinker of ppol2 6 and are flanked respectively by 
a Bglll and a BamHI site. The gl gene was then transferred from pgB complete 
as a 3260 bp Bglll+BamHI fragment to the BamHI site of the vaccinia virus 
insertion vector pGS2 0 (FIG. 9) to generate pgBvax (plasmid pGS2 0 with gl 
gene). Moss et al . in Gene Ampification and Analysis, Vol. 3, pp. 201-213 
(Papas et al . eds . 1983). 
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L2 6: Entry 1 of 2 



File: PGPB 



Jun 29, 2006 



DOCUMENT-IDENTIFIER: US 20060142949 Al 

TITLE: System, method, and computer program product for dynamic display, and 
analysis of biological sequence data 

Description of Disclosure : 

[0105] Some embodiments of biological sequence tools 212 may include another tool 
of pane 430 that may be available for analyzing a loaded or user selected sequence 
region for what is commonly referred to as an open reading or translation frame 
Typically, for what are referred to as eukaryotes, three nucleotide bases typically 
code for each translated protein base The three nucleotide bases are commonly 
referred to as a codon that may be read by a cell's translation machinery in what 
is commonly referred to as the translation or reading frame Each sequence of DNA 
has six possible reading frames, three in each direction Typically, only one 
reading frame codes for a protein and is referred to as the open reading frame As 
is known to those of ordinary skill in the related art, the open reading frame 
typically begins with what is referred to as a start codon, and ends with a stop 
codon The open reading frame analysis tool may be accessible by a user selection of 
ORF tab 1505 as illustrated in FIG. 15 Upon selection of tab 1505 ORF scale bar 
1520 may be displayed in ORF selectable field 1510 In some implementations, the 
scale bar may represent a selectable minimum size of the ORF to be identified in 
loaded sequence 1407 or selected sequence such as, for instance, selected sequence 
1430 of FIG. 14 User 101 may interactively select a value represented on scale bar 
1520 by moving ORF scale tab 1525, via commonly used methods such as clicking and 
dragging with a mouse, to the desired position along scale bar 1520 In the 
illustrated implementation, scale bar 152 0 may use a variety of different 
incremental scales, such as for instance numbers of base residues, as well as what 
is referred to by those of ordinary skill in the related art as kilobases, 
megabases, centimorgans , or other incremental value used for sequence measurement 
In some embodiments, tab 1525 may be set to some default value that could 
correspond to an average ORF size or some other value A selection of analyze ORF 
button. 1515 instructs generator 210 to find one or more open reading frames in a 
loaded sequence or user selection of sequence, using the user selected criteria of 
scale tab 152 5 GUI manager 211 may return the results to the user in a variety of 
formats that could include one more colored boxes displayed in sequence coordinates 
pane 425 aligned with the one or more identified ORF's of sequence residues 1425 
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0049] In accordance with some implementations, some targets hybridize with probes and remain at the 
probe locations, while non-hybridized targets are washed away These hybridized targets, with their tags 
or labels, are thus spatially associated with the probes The term "hybridization" refers to the process in 
which two single-stranded polynucleotides bind non-covalently to form a stable double-stranded 
polynucleotide The term "hybridization" may also refer to triple-stranded hybridization, which is 
theoretically possible The resulting (usually) double-stranded polynucleotide is a "hybrid" The 
proportion of the population of polynucleotides that forms stable hybrids is referred to herein as the 
"degree of hybridization" Hybridization probes usually are nucleic acids (such as oligonucleotides) 
capable of binding in a base-specific manner to a complementary strand of nucleic acid Such probes 
include peptide nucleic acids, as described in Nielsen et al, Science 254 1497-1500 (1991) or Nielsen 
Curr Opin Biotechnol, 10 71-75 (1999) (both of which are hereby incorporated herein by reference), and 
other nucleic acid analogs and nucleic acid mimetics The hybridized probe and target may sometimes be 
referred to as a probe-target pair Detection of these pairs can serve a variety of purposes, such as to 
determine whether a target nucleic acid has a nucleotide sequence identical to or different from a 
specific reference sequence See, for example, U.S. Pat. No. 5,837,832, referred to and incorporated 
above Other uses include gene expression monitoring and evaluation (see, e g, U.S. Pat. No. 5,800,992 
to Fodor, et al, U.S. Pat. No. 6,040,138 to Lockhart, et al, and International App No PCT/US98/15151, 
published as WO99/05323, to Balaban, et al), genotyping (U.S. Pat. No. 5,856,092 to Dale, et al), or 
other detection of nucleic acids The "992, '138, and '092 patents, and publication WO99/05323, are 
incorporated by reference herein in their entireties for all purposes 

[0050] The present invention also contemplates signal detection of hybridization between probes and 
targets in certain preferred embodiments See U.S. Pat. Nos. 5,143,854, 5,578,832, 5,631,734, 5,936,324, 
5,981,956, 6,025,601 incorporated above and in U.S. Pat. Nos. 5,834,758, 6,141,096, 6,185,030, 
6,201,639, 6,218,803, and 6,225,625, in U.S. Patent application 60/364,731 and in PCT Application 
PCT/US99/06097 (published as W099/47964), each of which also is hereby incorporated by reference 
m its entirety for all purposes 

[0051] A system and method for efficiently synthesizing probe arrays using masks is described in U.S. 
patent application Ser. No. 09/824,931, filed Apr. 3, 2001, that is hereby incorporated by reference 
herein in its entirety for all purposes A system and method for a rapid and flexible microarray 
manufacturing and online ordering system is described in U.S. Provisional Patent Application Ser. No. 
60/265,103 filed Jan. 29, 2001, that also is hereby incorporated herein by reference in its entirety for all 
purposes Systems and methods for optical photolithography without masks are described in U.S. Pat. 
No. 6,271,957 and in U.S. patent application Ser. No. 09/683,374 filed Dec. 19, 2001, both of which are 
hereby incorporated by reference herein in their entireties for all purposes 

[0052] As noted, various techniques exist for depositing probes on a substrate or support For example, 
"spotted arrays" are commercially fabricated, typically on microscope slides These arrays consist of 
liquid spots containing biological material of potentially varying compositions and concentrations For 
instance, a spot in the array may include a few strands of short oligonucleotides in a water solution, or it 
may include a high concentration of long strands of complex proteins The Affymetrix.RTM. 417.TM, 
Arrayer and 427.TM. Arrayer are devices that deposit densely packed arrays of biological materials on 
microscope slides in accordance with these techniques Aspects of these and other spot arrayers are 
described in U.S. Pat. Nos. 6,040,193 and 6,136,269 and in PCT Application No PCT/US99/00730 
(International Publication Number WO 99/36760) incorporated above and in U.S. patent application Ser. 
No. 09/683,298 hereby incorporated by reference in its entirety for all purposes Other techniques for 
generating spotted arrays also exist For example, U.S. Pat. No. 6,040,193 to Winkler, et al is directed to 
processes for dispensing drops to generate spotted arrays The '193 patent, and U.S. Pat. No. 5,885,837 to 
Winkler, also describe the use of micro-channels or micro-grooves on a substrate, or on a block placed 
on a substrate, to synthesize arrays of biological materials These patents further describe separating 
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reactive regions of a substrate from each other by inert regions and spotting on the reactive regions The 
'193 and f 837 patents are hereby incorporated by reference in their entireties Another technique is based 
on ejecting jets of biological material to form a spotted array Other implementations of the jetting 
technique may use devices such as syringes or piezo electric pumps to propel the biological material It 
will be understood that the foregoing are non-limiting examples of techniques for synthesizing, 
depositing, or positioning biological material onto or within a substrate For example, although a planar 
array surface is preferred in some implementations of the foregoing, a probe array may be fabricated on 
a surface of virtually any shape or even a multiplicity of surfaces Arrays may comprise probes 
synthesized or deposited on beads, fibers such as fiber optics, glass, silicon, silica or any other 
appropriate substrate, see U.S. Pat. No. 5,800,992 referred to and incorporated above and U.S. Pat. Nos. 
5,770,358, 5,789,162, 5,708,153 and 6,361,947 all of which are hereby incorporated in their entireties 
for all purposes Arrays may be packaged in such a manner as to allow for diagnostics or other 
manipulation in an all inclusive device, see for example, U.S. Pat, Nos. 5,856,174 and 5,922,591 hereby 
incorporated in their entireties by reference for all purposes 
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[0057] Methods and apparatus for signal detection and processing of intensity data are disclosed in, for 
example, U.S. Pat. Nos. 5,143,854, 5,578,832, 5,631,734, 5,800,992, 5,834,758, 5,856,092, 5,936,324, 
5,981,956, 6,025,601, 6,090,555, 6,141,096, 6,185,030, 6,201,639, 6,207,960, 6,218,803, 6,225,625, in 
PCT Application PCT/US99/06097 (published as W099/47964) incorporated above, and m U.S. Pat. 
Nos. 5,547,839, 5,902,723, 6,171,793, 6,207,960, 6,252,236, 6,335,824, 6,490,533, 6,472,671, 
6,403,320, and 6,407,858 each of which is hereby incorporated by reference in its entirety for all 
purposes Other scanners or scanning systems are described in U.S. patent application Ser. No. 
09/682,837 filed Oct. 23, 2001, Ser. No. 09/683,216 filed Dec. 3, 2001, Ser. No. 09/683,217 filed Dec. 
3, 2001, Ser. No. 09/683,219 filed Dec. 3, 2001, and Ser. No. 10/389,194, filed Mar. 14, 2003, each of 
which is hereby incorporated by reference in its entirety for all purposes 

[0058] The present invention may also make use of various computer program products and software for 
a variety of purposes, such as probe design, management of data, analysis, and instrument operation See, 
U.S. Pat. Nos. 5,593,839, 5,795,716, 5,974,164, 6,090,555, 6,188,783 incorporated above and U.S. Pat. 
Nos. 5,733,729, 6,066,454, 6,185,561, 6,223,127, 6,229,911 and 6,308,170, hereby incorporated herein 
in their entireties for all purposes 
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[0059] Scanner 185 provides data representing the intensities (and possibly other characteristics, such as 
color) of the detected emissions, as well as the locations on the substrate where the emissions were 
detected The data typically are stored in a memory device, such as system memory 120 of user computer 
100, in the form of a data file or other data storage form or format One type of data file, such as image 
data file 212 shown in FIG. 2, typically includes intensity and location information corresponding to 
elemental sub-areas of the scanned substrate The term "elemental" in this context means that the 
intensities, and/or other characteristics, of the emissions from this area each are represented by a single 
value When displayed as an image for viewing or processing, elemental picture elements, or pixels, 
often represent this information Thus, for example, a pixel may have a single value representing the 
intensity of the elemental sub-area of the substrate from which the emissions were scanned The pixel 
may also have another value representing another characteristic, such as color For instance, a scanned 
elemental sub-area in which high-intensity emissions were detected may be represented by a pixel 
having high luminance (hereafter, a "bright" pixel), and low-intensity emissions may be represented by a 
pixel of low luminance (a "dim" pixel) Alternatively, the chromatic value of a pixel may be made to 
represent the intensity, color, or other characteristic of the detected emissions Thus, an area of high- 
intensity emission may be displayed as a red pixel and an area of low-intensity emission as a blue pixel 
As another example, detected emissions of one wavelength at a particular sub-area of the substrate may 
be represented as a red pixel, and emissions of a second wavelength detected at another sub-area may be 
represented by an adjacent blue pixel Many other display schemes are known Two examples of image 
data are data files in the form * dat or * tif as generated respectively by Affymetrix.RTM. Microarray 
Suite or Affymetrix.RTM. GeneChip.RTM. Operating Software based on images scanned from 
GeneChip.RTM. arrays, and by Affymetrix.RTM. Jaguar.TM. software based on images scanned from 
spotted arrays 

[0060] Probe- Array Analysis Applications 199 Generally, a human being may inspect a printed or 
displayed image constructed from the data in an image file and may identify those cells that are bright or 
dim, or are otherwise identified by a pixel characteristic (such as color) However, it frequently is 
desirable to provide this information in an automated, quantifiable, and repeatable way that is 
compatible with various image processing and/or analysis techniques For example, the information may 
be provided for processing by a computer application that associates the locations where hybridized 
targets were detected with known locations where probes of known identities were synthesized or 
deposited Other methods include tagging individual synthesis or support substrates (such as beads) using 
chemical, biological, electro-magnetic transducers or transmitters, and other identifiers Information such 
as the nucleotide or monomer sequence of target DNA or RNA may then be deduced Techniques for 
making these deductions are described, for example, in U.S. Pat. No. 5,733,729 and in U.S. Pat. No. 
5,837,832, noted and incorporated above 
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addition to the above general procedures which can be used for preparing 
recombinant DNA molecules and transformed unicellular organisms in accordance 
with the practices of this invention, other known techniques and modifications 
thereof can be used in carrying out the practice of the invention. In 
particular, techniques relating to genetic engineering have recently undergone 
explosive growth and development. Many recent U.S. Pat. Nos . disclose 
plasmids, genetically engineering microorganisms, and methods of conducting 
genetic engineering which can be used in the practice of the present 
invention. For example, U.S. Pat. No. 4,273,875 discloses a plasmid and a 
process of isolating the same. U.S. Pat. No. 4,304,863 discloses a process for 
producing bacteria by genetic engineering in which a hybrid plasmid is 
constructed and used to transform a bacterial host. U.S. Pat. No. 4,419,450 
discloses a plasmid useful as a cloning vehicle in recombinant DNA work. U.S. 
Pat. No. 4,362,867 discloses recombinant cDNA construction methods and hybrid 
nucleotides produced thereby which are useful in cloning processes. U.S. Pat. 
No. 4,403,036 discloses genetic reagents for generating plasmids containing 
multiple copies of DNA segments. U.S. Pat. No. 4,363,877 discloses recombinant 
DNA transfer vectors. U.S. Pat. No. 4,356,270 discloses a recombinant DNA 
cloning vehicle and is a particularly useful disclosure for those with limited 
experience in the area of genetic engineering since it defines many of the 
terms used in genetic engineering and the basic processes used therein. U.S. 
Pat. No. 4,336,336 discloses a fused gene and a method of making the same. 
U.S. Pat. No. 4,349,629 discloses plasmid vectors and the production and use 
thereof. U.S. Pat. No. 4,332,901 discloses a cloning vector useful in 
recombinant DNA. Although some of these patents are directed to the production 
of a particular gene product that is not within the scope of the present 
invention, the procedures described therein can easily be modified to the 
practice of the invention described in this specification by those skilled in 
the art of genetic engineering. 
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L10: Entry 48 of 48 



File: USPT 



Mar 8, 1994 



DOCUMENT- IDENTIFIER: US 5292658 A 

TITLE: Cloning and expressions of Renilla lucif erase 



Detailed Description Text (63) : 

The pTZRLuc-1 crude supernatant s were further characterized by SDS-PAGE. The 
Coomassie- stained gel contained numerous bands, one of which ran in the vicinity of 
native lucif erase. To confirm that this band was recombinant lucif ers, Western 
analysis was performed using rabbit polyclonal antibodies raised against native 
Renilla lucif erase. The developed Western showe'd one band that migrated at the same 
position as native lucif erase. No other products indicative of ..beta.- 
galactosidase-lucif erase fusion polypeptide were apparent, suggesting that either 
any putative fusion protein is in too low a concentration to be detected or, more 
likely, that no fusion protein is made. Though it has not been confirmed by DNA 
sequence analysis , any pTZRLuc-1 translation products initiating at the .beta.- 
galactosidase ATG start codon within the first three codons immediately adjacent to 
the first cDNA start codon may explain why we see IPTG induction of luciferase 
activity without production of a fusion product. 
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L10: Entry 46 of 48 



File: USPT 



Jan 2, 



1996 



DOCUMENT- IDENTIFIER: US 54 80 972 A 

TITLE: Allergenic proteins from Johnson grass pollen 
Detailed Description Text (2) : 

The present invention provides nucleic acid sequences coding for Sor hi, a major 
allergen found in Johnson grass pollen. The nucleic acid sequence coding for Sor h 
I preferably has the sequence shown in FIG. 5 (SEQ ID NO: 1) . Sequence analysis of 
the Sor h I clone 3S revealed that the cDNA insert is 1072 nucleotide long and 
contains 3 possible in-frame ATG start codons at nucleotide positions 25, 37 and 
40. The ATG codon at position 40 is proposed as the site for translateion 
initiation. This corresponds to an open reading frame of 783 nucleotides 
terminating with a TAA stop codon at position 823 and coding for a protein of 261 
amino acids. See FIG. 5 (SEQ ID NO: 1 and 2) . A host cell transformed with a vector 
containing the cDNA insert of clone 3S has been deposited with the American Type 
Culture Collection ATCC No. 69106 on Oct. 28, 1992. 
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L36: Entry 13 of 13 



File: USPT 



Sep 29, 1992 



DOCUMENT- IDENTIFIER: US 51512 67 A 

TITLE: Bovine herpesvirus type 1 polypeptides and vaccines 
Detailed Description Text (104) : 

The gl gene maps between 0.422 and 0.443 genome equivalents (FIGS. 4 and 5), which 
is within the BHV-1 Hindlll A fragment described by Mayfield et al. (1983), supra. 
A Kpnl plus AccI partial digestion of the Hindlll A fragment produces a 3255 base 
pair (bp) subfragment which contains the entire gl gene coding sequence. DNA 
sequence analyses placed an AccI site 20 bp 5' to the • ATG start codon, while the 
Kpnl site is 420 bp 3' to the TGA stop codon. This fragment was inserted into a 
synthetic DNA polylinker present between the EcoRI and Sail sites of PBR328 (i.e., 
ppol26, not shown) to produce pgB complete (FIG. 8) . To this end, the AccI 
asymmetric end of the 3255 bp fragment was first blunted with Klenow enzyme and the 
gl fragment was then ligated to the Hpal plus Kpnl sites of ppol26 to give pgB 
complete. Hpal and Kpnl sites are within the polylinker of ppol26 and are flanked 
respectively by a Bglll and a BamHI site. The gl gene was then transferred from pgB 
complete as a 3260 bp Bglll+BamHI fragment to the BamHI site of the vaccinia virus 
insertion vector pGS20 (FIG. 9) to generate pgBvax (plasmid pGS20 with gl gene) . 
Moss et al . in Gene Ampification and Analysis, Vol. 3, pp. 201-213 (Papas et al . 
eds. 1983) . 
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L3 6: Entry 12 of 13 



File: USPT 



Aug 16, 1994 



DOCUMENT- IDENTIFIER: US 5338683 A 

TITLE: Vaccinia virus containing DNA sequences encoding herpesvirus glycoproteins 



Detailed Description Text (45) : 

DNA sequence analysis revealed an open reading frame extending from nucleotide 
positions 300 to 3239 reading from left to right relative to the EHV-1 genome, i.e. 
the ATG start codon was contained in the BamHI-a/EcoRI fragment and the stop codon 
TAA was contained in the BamHI-i fragment (3,59). 
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L36: Entry 13 of 13 File: USPT Sep 29, 1992 



DOCUMENT- IDENTIFIER: US 51512 67 A 

TITLE: Bovine herpesvirus type 1 polypeptides and vaccines 



Detailed Description Text (104) : 

The gl gene maps between 0.422 and 0.443 genome equivalents (FIGS. 4 and 5), which 
is within the BHV-1 Hindlll A fragment described by Mayfield et al. (1983), supra. 
A Kpnl plus AccI partial digestion of the Hindlll A fragment produces a 3255 base 
pair (bp) subfragment which contains the entire gl gene coding sequence. DNA 
sequence analyses placed an AccI site 20 bp 5' to the ATG start codon, while the 
Kpnl site is 420 bp 3 ' to the TGA stop codon. This fragment was inserted into a 
synthetic DNA polylinker present between the EcoRI and Sail sites of PBR328 (i.e., 
ppol26, not shown) to produce pgB complete (FIG. 8) . To this end, the AccI 
asymmetric end of the 3255 bp fragment was first blunted with Klenow enzyme and the 
gl fragment was then ligated to the Hpal plus Kpnl sites of ppol26 to give pgB 
complete. Hpal and Kpnl sites are within the polylinker of ppol26 and are flanked 
respectively by a Bglll and a BamHI site. The gl gene was then transferred from pgB 
complete as a 3260 bp Bglll+BamHI fragment to the BamHI site of the vaccinia virus 
insertion vector pGS20 (FIG. 9) to generate pgBvax (plasmid pGS20 with gl gene) . 
Moss et al. in Gene Ampification and Analysis, Vol. 3, pp. 201-213 (Papas, et al . 
eds. 1983) . 
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L36: Entry 11 of 13 



File: USPT 



Jan 9, 1996 



DOCUMENT- IDENTIFIER: US 5482713 A 

TITLE: Equine herpesvirus recombinant poxvirus vaccine 
Detailed Description Text (45) : 

DNA sequence analysis revealed an open reading frame extending from nucleotide 
positions 300 to 3239 reading from left to right relative to the EHV-1 genome, i.e. 
the ATG start codon was contained in the BamHI-a/EcoRI fragment and the stop codon 
TAA was contained in the BamHI-i fragment (3,59). 
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L36: Entry 10 of 13 



File: USPT 



Feb 13, 1996 



DOCUMENT- IDENTIFIER: US 5491086 A 

TITLE: Purified thermostable nucleic acid polymerase and DNA coding sequences from 
pyrodictium species 

Detailed Description Text (77) : 

The DNA sequence analysis of pPabl4 revealed an open reading frame of 803 amino 
acids having an ATG start codon at nucleotide position 869 and a TGA stop codon at 
nucleotide position 3280. The 5* end of the Pab gene was mutagenized with 
oligonucleotide primers AW3 97 (SEQ ID No. 5) and AW3 98 (SEQ ID No. 6) by PCR 
amplification (as described below) . AW397 (SEQ ID No. 5) is forward primer which 
was designed to alter the Pab DNA sequence at the ATG start to introduce an Ndel 
restriction site. Primer AW397 (SEQ ID No. 5) also introduced mutations in the 
fifth and sixth codons of the Pab polymerase gene sequence to be more compatible 
with the codon usage of E. coli, without changing the amino acid sequence of the 
encoded protein. The reverse primer, AW398 (SEQ ID No. 6) , was chosen to include a 
Spel site corresponding to amino acid position 174. In addition, a Kpnl site was 
introduced after the Spel site. 
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